Next: Topic Model, Previous: Probabilistic Modeling, Up: Index

Text Similarity

State of the Art

value for word $i$ in document vector: BM25 $(w_i,D)$
normalize length of each document vector to 1
distance $(\vec x,\vec y)$ = $\sum_{i=1}^N\text{IDF}(w_i)\times x(i)\times y(i)$

BM-25: Term frequency weight

BM25 (w, D) = 1 + k 1 + k / ( c ( w ) 1 + b ( | D | - | D ¯ | ) / | D ¯ | )

$\text{BM25}(w,D) = \frac{1+k} {1+k/\left( \frac{c(w)} {1+b \left(|D| - |\bar D|\right) / |\bar D|} \right)}$

$c(w)$ : word count in document D
$|D|$ : document length
$|\bar D|$ : average document length
parameter $k\in [0,\infty)$ : set upper bound to $k+1$
parameter $b\in [0,1]$ : control length normalization

Illustration for $\frac{1+k}{1+k/x}$ :

IDF: Penalizing Popular Terms

Reference

Text Mining: https://www.coursera.org/learn/text-mining